Feature subset selection using a new definition of classifiability

نویسندگان

  • Ming Dong
  • Ravi Kothari
چکیده

The performance of most practical classifiers improves when correlated or irrelevant features are removed. Machine based classification is thus often preceded by subset selection––a procedure which identifies relevant features of a high dimensional data set. At present, the most widely used subset selection technique is the so-called ‘‘wrapper’’ approach in which a search algorithm is used to identify candidate subsets and the actual classifier is used as a ‘‘black box’’ to evaluate the fitness of the subset. Fitness evaluation of the subset however requires cross-validation or other resampling based procedure for error estimation necessitating the construction of a large number of classifiers for each subset. This significant computational burden makes the wrapper approach impractical when a large number of features are present. In this paper, we present an approach to subset selection based on a novel definition of the classifiability of a given data. The classifiability measure we propose characterizes the relative ease with which some labeled data can be classified. We use this definition of classifiability to systematically add the feature which leads to the most increase in classifiability. The proposed approach does not require the construction of classifiers at each step and therefore does not suffer from as high a computational burden as a wrapper approach. Our results over several different data sets indicate that the results obtained are at least as good as that obtained with the wrapper approach. 2002 Elsevier Science B.V. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature subset selection using a new definition of classifiability q

The performance of most practical classifiers improves when correlated or irrelevant features are removed. Machine based classification is thus often preceded by subset selection—a procedure which identifies relevant features of a high dimensional data set. At present, the most widely used subset selection technique is the so-called "wrapper" approach in which a search algorithm is used to iden...

متن کامل

A New Hybrid Feature Subset Selection Algorithm for the Analysis of Ovarian Cancer Data Using Laser Mass Spectrum

Introduction: Amajor problem in the treatment of cancer is the lack of an appropriate method for the early diagnosis of the disease. The chemical reaction within an organ may be reflected in the form of proteomic patterns in the serum, sputum, or urine. Laser mass spectrometry is a valuable tool for extracting the proteomic patterns from biological samples. A major challenge in extracting such ...

متن کامل

Online Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features

Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...

متن کامل

A New Framework for Distributed Multivariate Feature Selection

Feature selection is considered as an important issue in classification domain. Selecting a good feature through maximum relevance criterion to class label and minimum redundancy among features affect improving the classification accuracy. However, most current feature selection algorithms just work with the centralized methods. In this paper, we suggest a distributed version of the mRMR featu...

متن کامل

A Parallel Genetic Algorithm Based Method for Feature Subset Selection in Intrusion Detection Systems

Intrusion detection systems are designed to provide security in computer networks, so that if the attacker crosses other security devices, they can detect and prevent the attack process. One of the most essential challenges in designing these systems is the so called curse of dimensionality. Therefore, in order to obtain satisfactory performance in these systems we have to take advantage of app...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Pattern Recognition Letters

دوره 24  شماره 

صفحات  -

تاریخ انتشار 2003